Model Selection

Vision-Language Understanding

# Vision-Language Understanding

Skywork VL Reward 7B

Skywork-VL-Reward-7B is a 7B-parameter multimodal reward model based on the Qwen2.5-VL-7B-Instruct architecture, enhanced with a value head structure for training reward models.

Multimodal Fusion

Llama 3.2 11B Vision Radiology Mini

This is a multimodal model based on the Llama architecture, supporting vision and text instructions, optimized with 4-bit quantization.

Internvl3 78B Pretrained

InternVL3-78B is an advanced multimodal large language model developed by OpenGVLab, demonstrating exceptional comprehensive performance. Compared to its predecessor InternVL 2.5, it possesses stronger multimodal perception and reasoning capabilities, extending its abilities to new domains such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.

Transformers Other

VoRA is a vision-language model based on 7B parameters, capable of processing image and text inputs to generate text outputs.

Internvl2 5 HiMTok 8B

HiMTok is a hierarchical mask token learning framework fine-tuned on the InternVL2_5-8B large multimodal model, focusing on image segmentation tasks.

mmMamba-linear is the first pure decoder multimodal state space model to achieve quadratic-to-linear distillation with moderate academic computing resources, featuring efficient multimodal processing capabilities.

Minivla Vq Libero90 Prismatic

MiniVLA is a lightweight vision-language model compatible with the Prismatic VLMs training framework, supporting multimodal tasks from image-text to text.

Transformers English

Florence 2 Large Ft

Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based paradigm to handle various vision and vision-language tasks.

Denseconnector V1.5 8B

DenseConnector is an open-source chatbot, fine-tuned based on LLaMA/Vicuna and trained using GPT-generated multimodal instruction-following data.

Llava Next Mistral 7b 4096

A multimodal model fine-tuned based on LLaVA-v1.6-Mistral-7B, supporting joint understanding and generation of images and text

Kosmos 2 Patch14 24 Dup Ms

Kosmos-2 is a multimodal large language model capable of integrating visual information with language understanding to achieve image-to-text conversion and visual grounding tasks.

TinyLLaVA is a small-scale large multimodal model framework that significantly reduces the number of parameters while maintaining high performance. The 3.1B version outperforms similar 7B-scale models in multiple benchmarks.

Transformers Supports Multiple Languages

Saved Model Git Base

A vision-language model fine-tuned on image folder datasets based on microsoft/git-base, primarily used for image caption generation tasks

Transformers Other

Video Blip Opt 2.7b Ego4d

VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.

Transformers English

ViLT is a vision-and-language Transformer model pretrained on the GCC+SBU+COCO+VG dataset, focusing on joint understanding tasks of images and text.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase